knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(tidyverse)
Load the Italian olive oils dataset into R from the following location on the course GitHub page, and store it in a data.frame called olive: [https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/olive_oil.csv]
Brief data description:
areas and nine specific regions in ItalyFor more information on the data, see here.
Note: The units are on the variables in this dataset are a little confusing, so don’t worry about including units on your graphs this week.
olive <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/olive_oil.csv")
Using the tidyverse style guide.
eujingc_315_theme <- theme_bw() +
theme(axis.text = element_text(size = 12),
plot.title = element_text(size = 22, face = "bold", hjust = 0),
plot.subtitle = element_text(size = 14, face = "italic", hjust = 0),
text = element_text(size = 14, face = "bold", color = "darkslategrey"))
olive_sub <- select(olive, area, region, palmitic, palmitoleic, stearic)
olive_sub <- olive_sub %>% mutate(
area = recode(olive_sub$area, "South", "Sardinia", "North"),
region = recode(olive_sub$region, "North Apulia", "South Apulia", "Calabria",
"Sicily", "Umbria", "East Liguria", "West Liguria",
"Inland Sardinia", "Coastal Sardinia"))
library(GGally)
ggpairs(olive_sub, aes(color = area))
We can see from the boxplots of palmitic and palmitoleic conditioned on area, that the south differs from the other two locations quite significantly in terms of these two variables.
There also seems to be a positive correlation between palmitic abd palmitoleic.
ggplot(olive_sub,
aes(x = stearic, y = palmitic,
color = region, shape = area)) +
geom_point() +
labs(x = "Stearic", y = "Palmitic",
color = "Region", shape = "Area",
title = "Palmitic against Stearic") +
eujingc_315_theme
(10 points each)
Density Estimates and Heat Maps
Recreate your graph from part (d) of Problem 1, but this time, overlay a 2-D density estimate of the overall joint distribution of the two continuous variables in the form of a contour plot.
From your graph in (a), are there any modes / groups apparent in the data? Do the groups correspond to any of the areas or regions?
Create a heat map of the two variables you used in your scatterplot from Problem 1, part (d). Use a non-default color scheme in your heatmap.
Around the World
gm <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/gapminder.csv")
gm_sub <- gm %>% filter(year > 2000) #data we would actually use.
This dataset contains information about life expectancy, population and GDP per capital collected in 2002 and 2007 for countries around the world. We’re going to use it to explore the relationships between gdp and life expectancy across continents.
(5 points) Create a scatter plot between the log10 transform of GDP per capita and life expectancy, coloring on continent.
(5 points) What patterns can you see? Describe the relationship and joint distribution between lifeExp and log10(gdpPercap) overall and conditional on continent. (Kinda expecting \(> 3\) sentences.)
(5 points) Try fitting a linear regression `geom_smooth(method = “lm”) and a non-parametric regression relating the log10 transform of GDP per capita and life expectancy (we only want 1 line per regression). Which do you prefer for this data set and why (which makes more sense/ fit the data better)?
(5 points) How should we interpret the regression lines’ confidence? In earlier statistics classes you saw confidence intervals for a mean value \(\mu\) and linear regression coefficents \(\beta\). Similar to these confidence intervals, confidence bands provide regions that, under assumptions of models, we’d expect to be able to capture the true regression curve in the confidence band \(1 - \alpha\) proportion of the time we drew samples from data with the regression as the truth. Additionally recall that the regression function is just a function that defined the mean value of \(Y\) given \(X\).
Nice interpretive properties:
if a confidence band (especially from a regression function) can include a flat line (with \(slope = 0\)) then we cannot reject the idea (hypothesis) that there is no relationship (or at least linear relationship if a linear regression) between \(X\) and \(Y\).
if two confidence bands overlap for their entirety of the range of \(X\), then we cannot reject that both groups have the same regression function.
level = .98 do in the geom_smooth (leave this at .95 unless you have a legitimate reason not too)? Using the statements in (d): Is there a statistically significant linear relationship between log10(gdpPercap) and lifeExp for countries in the the continent of Asia? Can you conclude that the linear relationship between log10(gdpPercap) and lifeExp for countries in Asia is different that for countries in Europe?ggplot(gm_sub %>% filter(continent %in% c("Asia", "Europe")),
aes(x = log10(gdpPercap), y = lifeExp)) +
geom_point(aes(color = continent)) +
geom_smooth(aes(color = continent), method = "lm", level = .98) +
labs(color = "Continent",
x = "log10(GDP per Capita)",
y = "Average Life Expectancy",
title = "Life Expectancy and GDP per Capita")